I compared the sugar and sodium content of the 7 biggest American
cereal manufacturers.
+ Made an animated plot with gganimate to
compare the 7 biggest American cereal manufacturers.
+ Wrote an
User-Defined function that returns the sugar and sodium content.
POV: It’s early December and you’re sitting in on your living room floor with coffee. Your pet sashays past the stack of books that lay beside you with their food bowl in their mouth, clearly asking your to feed them. You, immersed in your laptop, practicing R, get up and fill their bowl but as you stand in the kitchen you find out there is no human food left in the house. You put on your jacket and leave for the grocery shop. Standing before the breakfast cereal isle you are reluctant to decide which cereal to pick. A voice in your head goes “I could write an R script for this”. And so, you bow your head and leave the shop with enough to get you through the weekend.
During the evening however, when the sound of rain fills your ears and the fur of your pet touches your leg, you find yourself again in the living room, and reach out for your laptop again, and start typing.
## Load required packages
if (!require(tidyverse)) install.packages("tidyverse")
if (!require(tidytext)) install.packages("tidytext") #tidy data of text files
if (!require(ggplot2)) install.packages("ggplot2") # data visualization with plots
if (!require(magick)) install.packages("magick") #putting animated graphs side-by-side
if (!require(readxl)) install.packages("readxl") #imports Excel and CSV files
if (!require(gifski)) install.packages("gifski") #imports Excel and CSV files
if (!require(knitr)) install.packages("knitr") #text mining analysis
if (!require(dplyr)) install.packages("dplyr") #tidy data
if (!require(here)) install.packages("here") #informs R about the current working directory
if (!require(car)) install.packages("car") #Levene statistical test
if (!require(png)) install.packages("png")
if (!require(DT)) install.packages("DT") #interactive tables
## installation of devtools is necessary prior to installing the following packages
devtools::install_github("thomasp85/gganimate") #makes gif-like graphs
library(gganimate)
devtools::install_github("thomasp85/patchwork") #allows for combining graphs
library(patchwork)
devtools::install_github("ropensci/plotly") #interactive graphs
library(plotly)
## installation of remotes is necessary prior to installing the following package
remotes::install_github("wilkelab/ggtext") #customization of graph text using HTML rendering
library(ggtext)
## Load in the dataset using the tidyverse function read_csv()
cereal = read.csv(here::here("0_data",
"raw_data",
"cereal.csv"))
An important step of data prepping is data cleaning and
tidying to ensure the following:
+ All variables have each
own column (tidy data).
+ All column names are easily understood.
+ The data type of the variables are correct.
The process of prepping data includes data cleaning. Here I clean data by ensuring all variables have a clear enough name, and that all values are stated as words and not initials.
The biggest reason for this is because during the analysis plots will be generated. A graph and what it conveys should be understood all on its own, without having to refer back to the source material time and time again. Moreover, imagine having to look back at your analysis a few months from now and not understanding your own work anymore because you haven’t gathered enough information in one place.
## Change data from dataframe to tibble
cereal_tbl = as_tibble(cereal)
## let's take a look
DT::datatable(cereal_tbl)
Before performing any statistical tests, I first check whether the data distribution is normal.
In frequentest statistics it is necessary to determine whether the data follows a normal distribution in a population. In general frequentest statistics views samples as unknown yet fixed and there is always a fixed average, with some sample being above average and some below.
The first step of frequentest statistics is checking is our data is normally distributed. We assume a null-hypothesis (H0) which always states that the data is in fact NOT normally distributed. When this happens, we are actually forced to say that our data is not trustworthy and stop the analysis (I will not be doing this in this instance). Alternatively, a H1 hypothesis is also taken into consideration and this always states that the values are normally distributed.
The output of a normality test (Shapiro-Wilk) is a number which is called a P-value. When the P-value equals 0.05 or is above 0.05 we say that the data is normally distributed and we accept the H0 hypothesis. When this number is below 0.05 we say data is NOT normally distributed and we accept the H1 hypothesis.
## Normality test for SUGAR
p_sugar = shapiro.test(cereal_tbl$sugars_g)
## Normality test for SODIUM
p_sodium = shapiro.test(cereal_tbl$sodium_mg)
## Let's have a look at the P-values
p_sugar
##
## Shapiro-Wilk normality test
##
## data: cereal_tbl$sugars_g
## W = 0.95247, p-value = 0.005923
p_sodium
##
## Shapiro-Wilk normality test
##
## data: cereal_tbl$sodium_mg
## W = 0.92904, p-value = 0.0003456
## Check for variance in the SUGAR values
levene_sugar = leveneTest(cereal_tbl$sugars_g, as.factor(cereal_tbl$manufacturer), center = mean)
## Check for variance in the SODIUM values
levene_sodium = leveneTest(cereal_tbl$sodium_mg, as.factor(cereal_tbl$manufacturer), center = mean)
## Let's see
levene_sugar
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 6 1.3471 0.2483
## 70
levene_sodium
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 6 3.1152 0.009188 **
## 70
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We’ve now seen that the data wasn’t normally distributed. In most circumstances we would disregard the data altogether after this but in a Bayesian approach to this dataset I have chosen to continue.
After performing a Levene test we saw that the variance within groups wasn’t significant. Which means that the data point between manufacturers aren’t all that different from each other.
With the statistical test One-Way ANOVA test we put the manufacturers side by side and determine whether there is a relationship between the. We have more than 3 groups so we choose the ANOVA test and not a t-test. The outcome of this test is also a P-value however this test determines whether the difference that we find is truly significant. In other words “Do they really differ?”
If the P-value that we find is below 0.05 we again reject the H0 hypothesis and say that there is indeed a significant difference in sugar content when comparing the 7 manufacturers.
A way to build on the ANOVA test is by using the TukeyHSD function which takes into consideration each combination of manufacturer and the adjusted p-value (padj) for that difference.
## Perform a One-Way ANOVA test to test a relationship between manufacturer
## ANOVA test with SUGAR
sugar_anova = cereal_tbl %>% aov(sugars_g ~ manufacturer, data = .) %>% summary
## Is the lack of significance caused by a particular manufacturer?
sugar_tukey = cereal_tbl %>% aov(sugars_g ~ manufacturer, data = .) %>% TukeyHSD()
## let's get some perspective
head(sugar_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## manufacturer 6 262.2 43.69 2.468 0.0319 *
## Residuals 70 1239.4 17.71
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
head(sugar_tukey)
## $manufacturer
## diff lwr upr
## General Mills-American Home Food Products 4.9545455 -8.1061458 18.0152367
## Kelloggs-American Home Food Products 4.5652174 -8.4831234 17.6135581
## Nabisco-American Home Food Products -1.1666667 -14.9637403 12.6304069
## Post-American Home Food Products 5.7777778 -7.6867874 19.2423429
## Quaker Oats-American Home Food Products 2.2500000 -11.2984573 15.7984573
## Ralston Purina-American Home Food Products 3.1250000 -10.4234573 16.6734573
## Kelloggs-General Mills -0.3893281 -4.1986229 3.4199667
## Nabisco-General Mills -6.1212121 -12.0043041 -0.2381202
## Post-General Mills 0.8232323 -4.2310772 5.8775419
## Quaker Oats-General Mills -2.7045455 -7.9782753 2.5691844
## Ralston Purina-General Mills -1.8295455 -7.1032753 3.4441844
## Nabisco-Kelloggs -5.7318841 -11.5875062 0.1237381
## Post-Kelloggs 1.2125604 -3.8097483 6.2348691
## Quaker Oats-Kelloggs -2.3152174 -7.5582858 2.9278510
## Ralston Purina-Kelloggs -1.4402174 -6.6832858 3.8028510
## Post-Nabisco 6.9444444 0.2121619 13.6767270
## Quaker Oats-Nabisco 3.4166667 -3.4818701 10.3152035
## Ralston Purina-Nabisco 4.2916667 -2.6068701 11.1902035
## Quaker Oats-Post -3.5277778 -9.7346356 2.6790801
## Ralston Purina-Post -2.6527778 -8.8596356 3.5540801
## Ralston Purina-Quaker Oats 0.8750000 -5.5118040 7.2618040
## p adj
## General Mills-American Home Food Products 0.90941564
## Kelloggs-American Home Food Products 0.93692600
## Nabisco-American Home Food Products 0.99997464
## Post-American Home Food Products 0.84845188
## Quaker Oats-American Home Food Products 0.99872162
## Ralston Purina-American Home Food Products 0.99217422
## Kelloggs-General Mills 0.99992259
## Nabisco-General Mills 0.03608643
## Post-General Mills 0.99885466
## Quaker Oats-General Mills 0.70963334
## Ralston Purina-General Mills 0.93933394
## Nabisco-Kelloggs 0.05896520
## Post-Kelloggs 0.99002273
## Quaker Oats-Kelloggs 0.83041483
## Ralston Purina-Kelloggs 0.98050196
## Post-Nabisco 0.03883208
## Quaker Oats-Nabisco 0.74179283
## Ralston Purina-Nabisco 0.49462299
## Quaker Oats-Post 0.60168469
## Ralston Purina-Post 0.85085020
## Ralston Purina-Quaker Oats 0.99957403
What we can see is that when putting manufacturers Nabisco-General Mills side by side there is indeed a significant difference in sugar content (per serving given in grams). The same is seen when comparing manufacturers Post-Nabisco.
## Perform a One-Way ANOVA test to test a relationship between manufacturer
## ANOVA test with SUGAR
sodium_anova = cereal_tbl %>% aov(sodium_mg ~ manufacturer, data = .) %>% summary
## Is the lack of significance caused by a particular manufacturer?
sodium_tukey = cereal_tbl %>% aov(sodium_mg ~ manufacturer, data = .) %>% TukeyHSD()
## let's get some perspective
head(sodium_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## manufacturer 6 206474 34412 7.352 4.05e-06 ***
## Residuals 70 327643 4681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
head(sodium_tukey)
## $manufacturer
## diff lwr upr
## General Mills-American Home Food Products 200.454545 -11.9020580 412.811149
## Kelloggs-American Home Food Products 174.782609 -37.3731847 386.938402
## Nabisco-American Home Food Products 37.500000 -186.8296029 261.829603
## Post-American Home Food Products 146.111111 -72.8121647 365.034387
## Quaker Oats-American Home Food Products 92.500000 -127.7872970 312.787297
## Ralston Purina-American Home Food Products 198.125000 -22.1622970 418.412297
## Kelloggs-General Mills -25.671937 -87.6080847 36.264211
## Nabisco-General Mills -162.954545 -258.6090095 -67.300081
## Post-General Mills -54.343434 -136.5225464 27.835678
## Quaker Oats-General Mills -107.954545 -193.7012595 -22.207831
## Ralston Purina-General Mills -2.329545 -88.0762595 83.417169
## Nabisco-Kelloggs -137.282609 -232.4904347 -42.074783
## Post-Kelloggs -28.671498 -110.3303005 52.987305
## Quaker Oats-Kelloggs -82.282609 -167.5307911 2.965574
## Ralston Purina-Kelloggs 23.342391 -61.9057911 108.590574
## Post-Nabisco 108.611111 -0.8505268 218.072749
## Quaker Oats-Nabisco 55.000000 -57.1648015 167.164801
## Ralston Purina-Nabisco 160.625000 48.4601985 272.789801
## Quaker Oats-Post -53.611111 -154.5297548 47.307533
## Ralston Purina-Post 52.013889 -48.9047548 152.932533
## Ralston Purina-Quaker Oats 105.625000 1.7805723 209.469428
## p adj
## General Mills-American Home Food Products 7.679218e-02
## Kelloggs-American Home Food Products 1.748896e-01
## Nabisco-American Home Food Products 9.986733e-01
## Post-American Home Food Products 4.079992e-01
## Quaker Oats-American Home Food Products 8.610795e-01
## Ralston Purina-American Home Food Products 1.058934e-01
## Kelloggs-General Mills 8.682295e-01
## Nabisco-General Mills 4.220903e-05
## Post-General Mills 4.193706e-01
## Quaker Oats-General Mills 5.053407e-03
## Ralston Purina-General Mills 1.000000e+00
## Nabisco-Kelloggs 7.831643e-04
## Post-Kelloggs 9.358931e-01
## Quaker Oats-Kelloggs 6.546524e-02
## Ralston Purina-Kelloggs 9.808155e-01
## Post-Nabisco 5.314904e-02
## Quaker Oats-Nabisco 7.505691e-01
## Ralston Purina-Nabisco 8.698145e-04
## Quaker Oats-Post 6.747117e-01
## Ralston Purina-Post 7.048225e-01
## Ralston Purina-Quaker Oats 4.361868e-02
The same statistical test is done to test for a significant difference in sodium content (given in milligrams per serving) per manufacturer. Surprisingly, every single combination of manufacturers shows a significant difference of sodium content.
Now that we asked the question is there a significant difference between the 7 manufacturers (turns out the answer is yes), we can calculate the mean of the sugar content and sodium content and plot them.
I do this by using the ggplot2 package as well as gganimate to turn the plots into a gif.
sugar_80 <- cereal_tbl %>%
group_by(manufacturer) %>%
ggplot(aes(x = cereal_name,
y = sugars_g,
fill = cereal_name)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Comparing the sugar content of 80 American cereals",
subtitle = "Sugar content is given per serving in grams",
x = "Sugar (g)",
y = "Cereal names") +
coord_flip() +
theme_dark()
facet_wrap(~manufacturer)
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
## make the graph interactive
ggplotly(sugar_80)
Graph 3: Bar chart of the 80 cereals and their sugar content (g)
sodium_80 <- cereal_tbl %>%
group_by(manufacturer) %>%
ggplot(aes(x = cereal_name,
y = sodium_mg,
fill = cereal_name)) +
geom_bar(stat = "identity") +
labs(title = "Comparing the sodium content of 80 American cereals",
subtitle = "Sodium content is given per serving in milligrams",
x = "Sodium (mg)",
y = "Cereal names") +
coord_flip() +
theme_dark()
## if you're brave you can make a different plot for every manufacturers and their cereal brands
facet_wrap(~manufacturer)
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
## make the graph interactive
ggplotly(sodium_80)
Graph 4: Bar chart of the 80 cereals and the sodium content (mg)
Huge thanks to Connor Rothschild from {Rothschild (n.d.)}
Gif graphs, or
animated graphs are pretty neat however just like any other type of
graph we want to compare the two to get a clearer picture of what we’re
testing. The steps prior have been spent determining what the mean
values of sugar and sodium is in cereal, depending on the manufacturer.
Those mean values have been plotted, but having to scroll up and down to
see them isn’t essential, so below I put them side by side.
##turn the sugar plot into a gif
sugar_gif <- animate(sugar,
fps = 10,
duration = 25,
width = 550, height = 500,
renderer = gifski_renderer(here::here("images", "sugar.gif")))
##turn the sodium plot into a gif
sodium_gif <- animate(sodium,
fps = 10,
duration = 25,
width = 550,
height = 500,
renderer = gifski_renderer(here::here("images", "sodium.gif")))
##a gif is effectively may pictures of an image shown quickly
##store each frame of the plot in an object with image_read
sugar_mgif <- image_read(sugar_gif)
sodium_mgif <- image_read(sodium_gif)
##put both gif plots side by side
new_gif <- image_append(c(sugar_mgif[1], sodium_mgif[1]), stack = FALSE)
for(i in 2:90){
combined <- image_append(c(sugar_mgif[i], sodium_mgif[i]), stack = FALSE)
new_gif <- c(new_gif, combined)
}
##let's have a look
new_gif
Graph 5: Mean sugar and sodium content of the 7 biggest cereal manufacturers
I really love R. Namely because you can see the results of your code right away in the form of a graph, however when your dealing with a dataset that has many variables a simple graph won’t always cut it. While there are few manufacturers in this dataset, those manufacturers make a lot of types of cereal: 80 to be precise.
A graph with 80 bars just isn’t always to read and we have already established a great deal of information with the graphs displayed above.
What exactly? Well, that on average manufacturers Post and General Mills produce cereals that contain the highest content of sugars per servings, while Post and Ralston Purina produce cereals with the highest sodium content.
Let’s say we want to dive in a little deeper (I know I do) and say, while comparing the biggest manufacturers on the market is great but how about we pull out of magnifying glass and comparing all 80 cereals without a plot.
A method of doing this is by making a custom function within R that asks us the name of the cereal and when we answer it returns the sugar and sodium content. The way to do this is by telling R your going to filter a column, your going to give R a name and that R needs to return the value of that column {Rodrigues (2020)}.
Let me show you below how I do that.
## With the heartiest thanks to Bruno Rodrigues whom's bookdown "Modern R with the tidyverse"
## has helped with making this function
##function name
nutrition <- function(dataset, col_name, value){
col_name <- enquo(col_name)
dataset %>%
## My goal is to make a function that will return all nutritional values when I give a cereal name so,
## filter the column that contains the cereal names
filter((!!col_name) == value)
## If getting the entire list of nutrients isn't your thing then,
## summarize to get the mean sugar and sodium value
#%>%
# summarise(mean_sugar = mean(sugars_g),
# mean_sodium = mean(sodium_mg))
}
## Our function name is nutrition
## Give the function the dataset name, column name and lastly the name of the cereal
nutrition(cereal_tbl, cereal_name, "Apple Jacks")
## # A tibble: 1 × 15
## cereal_name manufacturer type kcal protein_g fat_g sodium_mg fiber_g carbs_g
## <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl>
## 1 Apple Jacks Kelloggs Cold 110 2 0 125 1 11
## # … with 6 more variables: sugars_g <int>, potass_mg <int>, vit_perc <int>,
## # shelf <int>, weight_oz <dbl>, cups <dbl>
How serious should we take these finding?
At the beginning of this analysis we found that the Shapiro-Wilk
test determined that the data was not normally distributed which was not
very surprising considering the relatively small number of values we are
dealing with. However this is still a small set of cereals that we’re
analyzing and seeing as it’s unclear when the dataset was last updated,
I would suggest we take these findings with a pinch of
sodium(chloride).
<!--Color of the content bar-->
.list-group-item.active, .list-group-item.active:hover, .list-group-item.active:focus {
z-index: 2;
color: #ffffff;
background-color: #9166A1;
border-color: #9166A1;
}
.list-group-item.active, .list-group-item.active, .list-group-item.active:focus {
z-index: 2;
color: #ffffff;
background-color: #9166A1;
border-color: #9166A1;
}
body {
font-family: 'Playfair Display', serif;
font-size: 15px;
line-height: 1.42857143;
color: #777777;
background-color: #ffffff;
}
<!--Color of the h1 header-->
.columns {display: flex;}
h1 {color: #9166A1;}
.p {background-color: #CFB0CF;}